Deep Learning - The Mathematics Behind Neural Network - Part 2

Note: This blog is part of a learn-along series, so there may be updates and changes as we progress.

In the previous blog, we covered the foundational concepts of neural networks. In this post, we learn the mathematics behind a basic neural network structure as illustrated below:

Introduction to Neural Network Structure

Input Nodes: $n_1$ and $n_2$
Hidden Nodes: $n_3$ to $n_8$
Output Node: $n_9$

Each node is associated with a bias, denoted as $b_i$ , and each synapse (connection between nodes) has a weight, denoted as $w_i$ . The initial input values are $x_1$ and $x_2$ , and $\hat{y}$ represents the target output value.

Initializing Inputs, Weights, and Biases

First, let's give the inputs, weights and biases an initial value to better visualize what is happening.

$x_1 = 0.23, x_2 = 0.55$ $w_1 = 0.1, w_2 = 0.2, w_3 = 0.3, w_4 = 0.4, w_5 = 0.5, w_6 = 0.6, w_7 = 0.7, w_8 = 0.8, w_9 = 0.9$ $w_{10} = 0.1, w_{11} = 0.2, w_{12} = 0.3, w_{13} = 0.4, w_{14} = 0.5, w_{15} = 0.6, w_{16} = 0.7, w_{17} = 0.8, w_{18} = 0.9$ $b_1 = 0.1, b_2 = -0.4, b_3 = 0.3, b_4 = -0.5, b_5 = 0.5, b_6 = 0.6, b_7 = -0.7$

Forward Propagation Through Hidden Layers

Now, let's understand how the input values works through the neural network. Looking at the node $n_3$ which is in the hidden layer, we can see that it receives two inputs through the two synapses and as we discussed in the previous blog, the node usually does two things with the input. It finds its total net input and applies an activation function to the total net input to get the output of node $n_3$ . Following some common practices we will use ReLu which is an activation function for the node(s) of the hidden layers and Sigmoid for the node(s) of the output layer.

$net_{n_3} = w_1 \cdot x_1 + w_4 \cdot x_2 + b_1$

$net_{n_3} = 0.1 \cdot 0.23 + 0.4 \cdot 0.55 + 0.1 = 0.343$

$out_{n_3} = max(0, net_{n_3})$

$out_{n_3} = max(0, 0.343) = 0.343$

Here is the output for the rest of the nodes in the hidden layer 1:

$out_{n_4} = max(0, -0.079) = 0$

$out_{n_5} = max(0, 0.699) = 0.699$

Now, the output of the nodes in the hidden layer 1 becomes the input of the nodes in the hidden layer 2 as shown in the diagram below.

And after repeating the same steps to find the output of the nodes we get the following outputs.

$out_{n_6} = max(0, 0.0197) = 0.0197$

$out_{n_7} = max(0, 1.1239) = 1.1239$

$out_{n_8} = max(0, 1.3281) = 1.3281$

Computing Output Layer Activation

As mentioned above, we will be using the Sigmoid function as the activation function for the nodes in the output layer, which in this case is only one node.

$net_{n_9} = w_{16} \cdot out_{n_6} + w_{17} \cdot out_{n_7} + w_{18} \cdot out_{n_8} + b_7$

$net_{n_9} = 1.4082$

$\sigma(x) = \frac{1}{1 + e^{-x}}$

$\hat{y} = out_{n_9} = \sigma(net_{n_9}) = \frac{1}{1 + e^{-net_{n_9}}}$

$\hat{y} = \sigma(1.4082) = \frac{1}{1 + e^{-1.4082}} = 0.8035$

Calculating Total Error

Our next step is to calculate the total error of the neural network. This can be done using a variety of methods but we will be making the use of the Squared Error with a multiplier of $\frac{1}{2}$ so that the derivative we will be doing later on will be much cleaner.

$E(y, \hat{y}) = \frac{1}{2}\sum (y - \hat{y})^2$

$\hat{y}$ represents the ideal output and $y$ represents the actual output. And let's assume that $y = 0.01$ to continue with our explanation.

$E(y, \hat{y}) = \frac{1}{2} (y - \hat{y})^2$

$E(0.01, 0.8035) = \frac{1}{2} (0.01 - 0.8035)^2 = 0.315$

If we had more than one output, we would have to calculate the error for each output and sum them to get the total error. But since we have only one output we can say that $E_{total} = 0.315$

Backpropagation and Weight Updates

Next, we have to do the backwards pass also know as backpropagation which updates the weights and biases. This is done to bring the ideal output closer to the actual output, which also reduces the total error in the process. Let's first try to update the weight $w_{16}$ . Before we update it, we must know how much a change in the weight $w_{16}$ affects the total error $E_{total}$ .

$\frac{\partial E_{\text{total}}}{\partial w_{16}}$

If we apply the chain rule to $\frac{\partial E_{\text{total}}}{\partial w_{16}}$ we get:

$\frac{\partial E_{total}}{\partial w_{16}} = \frac{\partial E_{\text{total}}}{\partial out_{n_9}} \cdot \frac{\partial out_{n_9}}{\partial net_{n_9}} \cdot \frac{\partial net_{n_9}}{\partial w_{16}}$

$E_{total} = \frac{1}{2}\sum (y - \hat{y})^2 = \frac{1}{2} (y - out_{n_9})^2$

$\frac{\partial E_{\text{total}}}{\partial out_{n_9}} = out_{n_9} - y = 0.8035 - 0.01 = 0.7935$

$out_{n_9} = \frac{1}{1 + e^{-net_{n_9}}}$

$\frac{\partial out_{n_9}}{\partial net_{n_9}} = out_{n_9}(1 - out_{n_9}) = 0.8035 \cdot (1 - 0.8035) = 0.1579$

$net_{n_9} = w_{16} \cdot out_{n_6} + w_{17} \cdot out_{n_7} + w_{18} \cdot out_{n_8} + b_7$

$\frac{\partial net_{n_9}}{\partial w_{16}} = out_{n_6} = 0.0197$

Combining these we get:

$\frac{\partial E_{\text{total}}}{\partial w_{16}} = (out_{n_9} - y) \cdot out_{n_9} \cdot (1 - out_{n_9}) \cdot out_{n_6}$

$\frac{\partial E_{\text{total}}}{\partial w_{16}} = (0.7935) \cdot (0.1579) \cdot 0.0197 = 0.0025$

Now, we can update the weight $w_{16}$ using the learning rate $\eta$ :

$w_{16}^{\text{new}} = w_{16}^{\text{old}} - \eta \cdot \frac{\partial E_{\text{total}}}{\partial w_{16}}$

Assuming a learning rate $\eta = 0.01$ :

$w_{16}^{\text{new}} = 0.7 - 0.01 \cdot 0.0025 = 0.699975$

We can repeat this process for $w_{17}$ and $w_{18}$ :

$w_{17}^{\text{new}} = 0.8 - 0.01 \cdot 0.0850 = 0.798592$

$w_{18}^{\text{new}} = 0.9 - 0.01 \cdot 0.1003 = 0.898336$

Similarly, we update the biases. Let's start with $b_7$ :

$\frac{\partial E_{total}}{\partial b_{7}} = \frac{\partial E_{\text{total}}}{\partial out_{n_9}} \cdot \frac{\partial out_{n_9}}{\partial net_{n_9}} \cdot \frac{\partial net_{n_9}}{\partial b_{7}}$

$net_{n_9} = w_{16} \cdot out_{n_6} + w_{17} \cdot out_{n_7} + w_{18} \cdot out_{n_8} + b_7$

$\frac{\partial net_{n_9}}{\partial b_7} = 1$

$\frac{\partial E_{\text{total}}}{\partial b_7} = (0.7935) \cdot (0.1579) \cdot 1 = 0.1253$

$b_7^{\text{new}} = b_7^{\text{old}} - \eta \cdot \frac{\partial E_{\text{total}}}{\partial b_7}$

$b_7^{\text{new}} = -0.7 - 0.01 \cdot 0.1253 = -0.701253$

Iterative Training and Error Reduction

To update the weights and biases in the hidden layers, we need to propagate the error backward from the output layer to the hidden layers. We'll start by calculating the partial derivatives for the weights of the synapses going in the second hidden layer, and then move to the weights of the synapses going in the first hidden layer.

$\frac{\partial E_{total}}{\partial w_{7}} = \frac{\partial E_{\text{total}}}{\partial out_{n_9}} \cdot \frac{\partial out_{n_9}}{\partial net_{n_9}} \cdot \frac{\partial net_{n_9}}{\partial out_{n_6}} \cdot \frac{\partial out_{n_6}}{\partial net_{n_6}} \cdot \frac{\partial net_{n_6}}{\partial w_7}$

$\frac{\partial E_{\text{total}}}{\partial out_{n_9}} = 0.7935$

$\frac{\partial out_{n_9}}{\partial net_{n_9}} = 0.1579$

$\frac{\partial net_{n_9}}{\partial out_{n_6}} = w_{16} = 0.7$

\frac{\partial out_{n_6}}{\partial net_{n_6}} = \begin{cases} 1 & \text{if } net_{n_6} > 0 \\ 0 & \text{otherwise} \end{cases}

\frac{\partial out_{n_6}}{\partial net_{n_6}} = 1 \text{ (since } net_{n_6} = 0.0197 > 0)

$\frac{\partial net_{n_6}}{\partial w_{7}} = out_{n_3} = 0.343$

$\frac{\partial E_{total}}{\partial w_{7}} = 0.7935 \cdot 0.1579 \cdot 0.7 \cdot 1 \cdot 0.343 = 0.03008$

Now, update the weight $w_{10}$ using the learning rate $\eta = 0.01$ :

$w_{7}^{\text{new}} = w_{7}^{\text{old}} - \eta \cdot \frac{\partial E_{\text{total}}}{\partial w_{7}}$

$w_{7}^{\text{new}} = 0.7 - 0.01 \cdot 0.03008 = 0.699$

Similarly, for $w_1$ we apply the chain rule:

$\frac{\partial E_{\text{total}}}{\partial w_{1}} = \frac{\partial E_{\text{total}}}{\partial out_{n_9}} \cdot \frac{\partial out_{n_9}}{\partial net_{n_9}} \cdot \frac{\partial net_{n_9}}{\partial out_{n_6}} \cdot \frac{\partial out_{n_6}}{\partial net_{n_6}} \cdot \frac{\partial net_{n_6}}{\partial out_{n_3}} \cdot \frac{\partial out_{n_3}}{\partial net_{n_3}} \cdot \frac{\partial net_{n_3}}{\partial w_{1}}$

By iterating this process (training), the total error decreases, and the neural network improves its task performance.

In this post, we've covered the mathematics behind a basic neural network, focusing on how the inputs, weights, and biases interact to produce the final output. We've walked through the process of forward propagation, calculating the output of each node, and applied the backpropagation algorithm to update the weights and biases, reducing the total error.

Until next time, signing off.